Heart disease remains the leading cause of death globally, accounting for millions of deaths annually. The ability to accurately predict heart disease can lead to early diagnosis and intervention, significantly reducing the mortality rate associated with this condition. This project aims to leverage data science and machine learning techniques to analyze and predict the likelihood of heart attacks based on a wide range of factors including age, sex, chest pain type, resting blood pressure, serum cholesterol, and more.
Utilizing a dataset from Kaggle, this analysis will explore various features that may influence heart disease, implement several machine learning models to predict heart disease occurrence, and evaluate the performance of these models. The dataset includes diverse variables, offering a comprehensive insight into the factors that might contribute to heart disease.
Through this project, we aim to uncover significant predictors of heart disease, assess the predictive power of machine learning models in a medical context, and ultimately, contribute to the efforts of preventing heart disease by enabling early detection. This endeavor is not only a technical challenge but also a crucial step towards saving lives and improving health outcomes.
As we delve into the data and models, our goal is to present a clear, concise, and informative analysis that can serve as a foundation for further research and practical applications in the field of medical science and public health.
The Heart Attack Analysis & Prediction Dataset on Kaggle includes data relevant for analyzing and predicting heart attacks. To access this dataset, you can visit the Kaggle dataset page.
The heart is like an efficient pump, tirelessly working to circulate blood throughout the body about 60 to 80 times a minute when we’re at rest. Just like any other part of the body, the heart itself needs a steady supply of nutrients and oxygen, which it gets through its very own set of blood vessels known as coronary arteries.
Sometimes, these crucial arteries can have trouble with their blood flow due to blockages or narrowing, a condition known as coronary insufficiency. This issue can vary widely – it all depends on where the blockage is, how severe it is, and the specific arteries affected.
For some, this might just mean chest pain that pops up during a workout or some physical task, but goes away once they take a break. However, in more serious cases, if a coronary artery gets suddenly completely blocked, it can trigger a heart attack. This often starts with intense chest pain and, in the worst-case scenario, can lead to sudden death.
Variable definitions in the Dataset
Additional variable descriptions to help us
age - age in years
sex - sex (1 = male; 0 = female)
cp - chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 0 = asymptomatic)
trestbps - resting blood pressure (in mm Hg on admission to the hospital)
chol - serum cholestoral in mg/dl
fbs - fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
restecg - resting electrocardiographic results (1 = normal; 2 = having ST-T wave abnormality; 0 = hypertrophy)
thalach - maximum heart rate achieved
exang - exercise induced angina (1 = yes; 0 = no)
oldpeak - ST depression induced by exercise relative to rest
slope - the slope of the peak exercise ST segment (2 = upsloping; 1 = flat; 0 = downsloping)
ca - number of major vessels (0-3) colored by flourosopy
thal - 2 = normal; 1 = fixed defect; 3 = reversable defect
num - the predicted attribute - diagnosis of heart disease (angiographic disease status) (Value 0 = < diameter narrowing; Value 1 = > 50% diameter narrowing)
library(corrplot) # for the correlation plot
library(discrim) # for linear discriminant analysis
library(corrr) # for calculating correlation
library(knitr) # to help with the knitting process
library(MASS) # to assist with the markdown processes
library(tidyverse) # using tidyverse and tidymodels for this project mostly
library(tidymodels)
library(ggplot2) # for most of our visualizations
library(ggrepel)
library(rpart.plot) # for visualizing trees
library(vip) # for variable importance
library(janitor) # for cleaning out our data
library(ranger) # for building our randomForest
library(dplyr) # for basic r functions
library(yardstick) # for measuring certain metrics
library(naniar)
library(kableExtra) #
library(kknn)
tidymodels_prefer()
Heart_data <- read_csv("data/heart.csv", show_col_types = FALSE)
Heart_data
## # A tibble: 303 × 14
## age sex cp trtbps chol fbs restecg thalachh exng oldpeak slp
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 63 1 3 145 233 1 0 150 0 2.3 0
## 2 37 1 2 130 250 0 1 187 0 3.5 0
## 3 41 0 1 130 204 0 0 172 0 1.4 2
## 4 56 1 1 120 236 0 1 178 0 0.8 2
## 5 57 0 0 120 354 0 1 163 1 0.6 2
## 6 57 1 0 140 192 0 1 148 0 0.4 1
## 7 56 0 1 140 294 0 0 153 0 1.3 1
## 8 44 1 1 120 263 0 1 173 0 0 2
## 9 52 1 2 172 199 1 1 162 0 0.5 2
## 10 57 1 2 150 168 0 1 174 0 1.6 2
## # ℹ 293 more rows
## # ℹ 3 more variables: caa <dbl>, thall <dbl>, output <dbl>
vis_miss(Heart_data)
This plot shows us at a glance that there is no missing data in the whole dataset.
unique_number <- sapply(Heart_data, function(x) length(unique(x)))
unique_values_df <- data.frame("Total Unique Values" = unique_number)
rownames(unique_values_df) <- names(Heart_data)
print(unique_values_df)
## Total.Unique.Values
## age 41
## sex 2
## cp 4
## trtbps 49
## chol 152
## fbs 2
## restecg 3
## thalachh 91
## exng 2
## oldpeak 40
## slp 3
## caa 5
## thall 4
## output 2
numeric_var <- c("age", "trtbps", "chol", "thalachh", "oldpeak")
categoric_var <- c("sex", "cp", "fbs", "restecg", "exng", "slp", "caa", "thall", "output")
Heart_data %>%
select(numeric_var) %>%
summary()
## age trtbps chol thalachh oldpeak
## Min. :29.00 Min. : 94.0 Min. :126.0 Min. : 71.0 Min. :0.00
## 1st Qu.:47.50 1st Qu.:120.0 1st Qu.:211.0 1st Qu.:133.5 1st Qu.:0.00
## Median :55.00 Median :130.0 Median :240.0 Median :153.0 Median :0.80
## Mean :54.37 Mean :131.6 Mean :246.3 Mean :149.6 Mean :1.04
## 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:274.5 3rd Qu.:166.0 3rd Qu.:1.60
## Max. :77.00 Max. :200.0 Max. :564.0 Max. :202.0 Max. :6.20
ggplot(Heart_data, aes(x = age)) +
geom_histogram(binwidth = 5, fill = "blue", color = "black") +
ggtitle("Age Distribution") +
xlab("Age") +
ylab("Frequency")
ggplot(Heart_data, aes(x = trtbps)) +
geom_histogram(binwidth = 5, fill = "red", color = "black") +
ggtitle(" resting blood pressure (in mm Hg) Distribution") +
xlab(" resting blood pressure (in mm Hg)") +
ylab("Frequency")
ggplot(Heart_data, aes(x = chol)) +
geom_histogram(binwidth = 5, fill = "purple", color = "black") +
ggtitle(" cholestoral Distribution") +
xlab(" cholestoral") +
ylab("Frequency")
ggplot(Heart_data, aes(x = thalachh)) +
geom_histogram(binwidth = 5, fill = "green", color = "black") +
ggtitle(" maximum heart rate achieved Distribution") +
xlab(" maximum heart rate achieved") +
ylab("Frequency")
ggplot(Heart_data, aes(x = oldpeak)) +
geom_histogram(binwidth = 0.5, fill = "blue", color = "black") +
ggtitle(" ST depression induced by exercise relative to rest Distribution") +
xlab(" ST depression induced by exercise relative to rest") +
ylab("Frequency")
categoric_var
## [1] "sex" "cp" "fbs" "restecg" "exng" "slp" "caa"
## [8] "thall" "output"
heart_data_summary_sex <- Heart_data %>%
count(sex) %>%
mutate(perc = n / sum(n))
ggplot(heart_data_summary_sex, aes(x = "", y = n, fill = factor(sex))) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
scale_fill_manual(values = c("#FF5733", "#33B5FF")) +
labs(title = "sex (Gender)", fill = "sex") +
theme_void() +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
legend.text = element_text(color = "darkblue", size = 13, face = "bold"))
heart_data_summary_cp <- Heart_data %>%
count(cp) %>%
mutate(perc = n / sum(n))
colors <- c("#FF5733", "#33B5FF", "#CDDC39", "#9C27B0")
ggplot(heart_data_summary_cp, aes(x = "", y = n, fill = factor(cp))) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
scale_fill_manual(values = colors) +
labs(title = "cp (Chest Pain type )", fill = "cp") +
theme_void() +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
legend.text = element_text(color = "darkblue", size = 13, face = "bold"))
heart_data_summary_fbs <- Heart_data %>%
count(fbs) %>%
mutate(perc = n / sum(n))
colors <- c("#FF5733", "#33B5FF")
ggplot(heart_data_summary_fbs, aes(x = "", y = n, fill = factor(fbs))) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
scale_fill_manual(values = colors) +
labs(title = "fbs (fasting blood sugar )", fill = "fbs") +
theme_void() +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
legend.text = element_text(color = "darkblue", size = 13, face = "bold"))
heart_data_summary_restecg <- Heart_data %>%
count(restecg) %>%
mutate(perc = n / sum(n))
colors <- c("#FF5733", "#33B5FF", "#CDDC39")
ggplot(heart_data_summary_restecg, aes(x = "", y = n, fill = factor(restecg))) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
scale_fill_manual(values = colors) +
labs(title = "restecg (resting electrocardiographic results)", fill = "restecg") +
theme_void() +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
legend.text = element_text(color = "darkblue", size = 13, face = "bold"))
heart_data_summary_exng <- Heart_data %>%
count(exng) %>%
mutate(perc = n / sum(n))
colors <- c("#FF5733", "#CDDC39")
ggplot(heart_data_summary_exng, aes(x = "", y = n, fill = factor(exng))) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
scale_fill_manual(values = colors) +
labs(title = "exng (exercise induced angina)", fill = "exng") +
theme_void() +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
legend.text = element_text(color = "darkblue", size = 13, face = "bold"))
heart_data_summary_slp <- Heart_data %>%
count(slp) %>%
mutate(perc = n / sum(n))
colors <- c("#FF5733", "#CDDC39", "#9C27B0")
ggplot(heart_data_summary_slp, aes(x = "", y = n, fill = factor(slp))) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
scale_fill_manual(values = colors) +
labs(title = "slp (the slope of the peak exercise ST segment)", fill = "slp") +
theme_void() +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
legend.text = element_text(color = "darkblue", size = 13, face = "bold"))
heart_data_summary_caa <- Heart_data %>%
count(caa) %>%
mutate(perc = n / sum(n))
ggplot(heart_data_summary_caa, aes(x = "", y = n, fill = factor(caa))) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
labs(title = "caa (number of major vessels)", fill = "caa") +
theme_void() +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
legend.text = element_text(color = "darkblue", size = 13, face = "bold"))
heart_data_summary_thall <- Heart_data %>%
count(thall) %>%
mutate(perc = n / sum(n))
ggplot(heart_data_summary_thall, aes(x = "", y = n, fill = factor(thall))) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
labs(title = "thall (Thallium stress test)", fill = "thall") +
theme_void() +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
legend.text = element_text(color = "darkblue", size = 13, face = "bold"))
heart_data_summary_output <- Heart_data %>%
count(output) %>%
mutate(perc = n / sum(n))
ggplot(heart_data_summary_output, aes(x = "", y = n, fill = factor(output))) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
labs(title = "output (diagnosis of heart disease)", fill = "output") +
theme_void() +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
legend.text = element_text(color = "darkblue", size = 13, face = "bold"))
thal_zero_rows <- Heart_data %>% filter(thall == 0)
thal_zero_rows
## # A tibble: 2 × 14
## age sex cp trtbps chol fbs restecg thalachh exng oldpeak slp
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 53 0 2 128 216 0 0 115 0 0 2
## 2 52 1 0 128 204 1 1 156 1 1 1
## # ℹ 3 more variables: caa <dbl>, thall <dbl>, output <dbl>
# 0 will be filled with 2 that is most common value in thall
Heart_data$thall[Heart_data$thall == 0] <- 2
unique_thal_categories <- unique(Heart_data$thall)
unique_thal_categories
## [1] 1 2 3
vis_miss(Heart_data)
Heart_data_long <- Heart_data %>%
gather(key = "variables", value = "value", trtbps, chol, thalachh)
ggplot(Heart_data_long, aes(x = variables, y = value, fill = factor(sex))) +
geom_boxplot() +
labs(title = "Numerical Variables - Categorical Variables (Box Plot)",
x = "Variables",
y = "Value",
fill = "Sex")
ggplot(Heart_data_long, aes(x = variables, y = value, fill = factor(cp))) +
geom_boxplot() +
labs(title = "Numerical Variables - Categorical Variables (Box Plot)",
x = "Variables",
y = "Value",
fill = "Cp")
ggplot(Heart_data_long, aes(x = variables, y = value, fill = factor(output))) +
geom_boxplot() +
labs(title = "Numerical Variables - Categorical Variables (Box Plot)",
x = "Variables",
y = "Value",
fill = "output")
Heart_data_long_2 <- Heart_data %>%
gather(key = "variables", value = "value",age)
ggplot(Heart_data_long_2, aes(x = variables, y = value, fill = factor(output))) +
geom_boxplot() +
labs(title = "Numerical Variables - Categorical Variables (Box Plot)",
x = "Variables",
y = "Value",
fill = "output")
Heart_data_long_3 <- Heart_data %>%
gather(key = "variables", value = "value",oldpeak)
ggplot(Heart_data_long_3, aes(x = variables, y = value, fill = factor(output))) +
geom_boxplot() +
labs(title = "Numerical Variables - Categorical Variables (Box Plot)",
x = "Variables",
y = "Value",
fill = "output")
cor_mat <- cor(Heart_data)
corrplot(cor_mat, method = "color",
addCoef.col = "black",
tl.cex = 0.8,
number.cex = 0.6,
cl.cex = 0.8,
tl.col = "black",
tl.srt = 45,
order = "hclust" )
#### 4.2.3.1 Analysis Outputs(4)
Heart_data <- Heart_data %>% select(-c(chol, fbs, restecg))
Heart_data
## # A tibble: 303 × 11
## age sex cp trtbps thalachh exng oldpeak slp caa thall output
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 63 1 3 145 150 0 2.3 0 0 1 1
## 2 37 1 2 130 187 0 3.5 0 0 2 1
## 3 41 0 1 130 172 0 1.4 2 0 2 1
## 4 56 1 1 120 178 0 0.8 2 0 2 1
## 5 57 0 0 120 163 1 0.6 2 0 2 1
## 6 57 1 0 140 148 0 0.4 1 0 1 1
## 7 56 0 1 140 153 0 1.3 1 0 2 1
## 8 44 1 1 120 173 0 0 2 0 3 1
## 9 52 1 2 172 162 0 0.5 2 0 3 1
## 10 57 1 2 150 174 0 1.6 2 0 2 1
## # ℹ 293 more rows
Now we’re going to walk through the process of splitting the
Heart_data data set into two, a training set and a test
set.
set.seed(1234)
heart_split <- initial_split(Heart_data, prop = 0.80,
strata = output)
heart_train <- training(heart_split)
heart_test <- testing(heart_split)
Heart_data$output <- as.factor(Heart_data$output)
heart_test$output <- as.factor(heart_test$output)
#make recipe
heart_train$output <- as.factor(heart_train$output)
heart_recipe <- recipe(output ~., data=heart_train) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_center(all_predictors(), -all_outcomes()) %>%
step_scale(all_predictors(), -all_outcomes())
prep(heart_recipe) %>%
bake(new_data = heart_train) %>%
head() %>%
kable() %>%
kable_styling(full_width = F)
| age | sex | cp | trtbps | thalachh | exng | oldpeak | slp | caa | thall | output |
|---|---|---|---|---|---|---|---|---|---|---|
| 1.3490876 | 0.6494454 | -0.9444162 | 1.7141174 | -1.7727510 | 1.4069260 | 0.3741999 | -0.7014764 | 2.1413378 | -0.520390 | 0 |
| 1.3490876 | 0.6494454 | -0.9444162 | -0.6572784 | -0.8778349 | 1.4069260 | 1.3051534 | -0.7014764 | 1.1870031 | 1.204738 | 0 |
| 0.8209750 | -1.5334127 | -0.9444162 | 0.5284195 | 0.4432318 | -0.7078324 | 2.1514748 | -2.3657636 | 1.1870031 | -0.520390 | 0 |
| 0.9265975 | 0.6494454 | -0.9444162 | -0.0644295 | -0.1107639 | -0.7078324 | 0.2895678 | -0.7014764 | 0.2326684 | 1.204738 | 0 |
| -0.1296276 | 0.6494454 | -0.9444162 | 0.5284195 | 0.2301565 | 1.4069260 | 1.7283141 | -2.3657636 | -0.7216663 | 1.204738 | 0 |
| 0.1872399 | 0.6494454 | 1.0006793 | -0.0644295 | -0.3238392 | 1.4069260 | -0.3874893 | -0.7014764 | 0.2326684 | -2.245518 | 0 |
heart_folds <- vfold_cv(heart_train, v = 5)
knn_spec <- nearest_neighbor(neighbors = tune()) %>%
set_engine("kknn") %>%
set_mode("classification")
# tune neighbors from 1 to 10
neighbors_grid <- grid_regular(neighbors(range = c(1, 10)), levels = 10)
knn_workflow <- workflow() %>%
add_model(knn_spec) %>%
add_recipe(heart_recipe)
knn_tune_res_class <- tune_grid(
object = knn_workflow,
resamples = heart_folds,
grid = neighbors_grid
)
# For k-NN model
knn_metrics_class <- knn_tune_res_class %>%
collect_metrics()
knn_auc <- knn_metrics_class %>% filter(.metric == "roc_auc") # Filter for AUC metric
# Print AUC results
print(knn_auc)
## # A tibble: 10 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 roc_auc binary 0.801 5 0.0219 Preprocessor1_Model01
## 2 2 roc_auc binary 0.857 5 0.0151 Preprocessor1_Model02
## 3 3 roc_auc binary 0.876 5 0.0148 Preprocessor1_Model03
## 4 4 roc_auc binary 0.882 5 0.0145 Preprocessor1_Model04
## 5 5 roc_auc binary 0.884 5 0.0157 Preprocessor1_Model05
## 6 6 roc_auc binary 0.887 5 0.0148 Preprocessor1_Model06
## 7 7 roc_auc binary 0.888 5 0.0138 Preprocessor1_Model07
## 8 8 roc_auc binary 0.891 5 0.0120 Preprocessor1_Model08
## 9 9 roc_auc binary 0.893 5 0.0124 Preprocessor1_Model09
## 10 10 roc_auc binary 0.893 5 0.0130 Preprocessor1_Model10
From the data provided, the model with 10 neighbors (neighbors = 10) has the highest mean AUC of 0.8939479 . The standard error (std_err) for this model is also relatively low at 0.02220712, which suggests that the performance of the model is relatively stable across different folds of the data. Therefore, based on the data, Model 10 (with 10 neighbors) appears to have the best performance in terms of the area under the ROC curve across the different folds. It has the highest mean AUC and a low standard error, indicating strong and stable predictive performance.
best_neighbors_class <- select_by_one_std_err(knn_tune_res_class, desc(neighbors), metric = "roc_auc")
final_wf_knn_class <- finalize_workflow(knn_workflow, best_neighbors_class)
final_fit_knn_class <- fit(final_wf_knn_class, heart_train)
augment(final_fit_knn_class, new_data = heart_test) %>%
roc_auc(output, .pred_0)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.934
The output tibble shows a ROC AUC of 0.9339 on the test set, indicating that the model also has strong classification ability on unseen data. It is noteworthy that the ROC AUC value on the test set is even higher than the average value in cross-validation, which is a very positive result. However, caution should be taken to consider the possibility of overly optimistic evaluations or data leakage.
Overall, this result should provide a high level of confidence, indicating that the model is likely to perform well in practical applications.
We can calculate accuracies on the training data:
knn_acc_train <- augment(final_fit_knn_class, new_data = heart_train) %>%
accuracy(truth = output, estimate = .pred_class)
knn_acc_train
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.913
We can calculate accuracies on the testing data:
knn_acc_test <- augment(final_fit_knn_class, new_data = heart_test) %>%
accuracy(truth = output, estimate = .pred_class)
knn_acc_test
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.836
The model’s accuracy on the training data is approximately 91.32%, while on the testing data, it is around 83.61%. These figures indicate that the KNN model is performing quite well, although it exhibits some degree of overfitting, as indicated by the higher accuracy on the training data compared to the testing data.
log_reg <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
log_wkflow <- workflow() %>%
add_model(log_reg) %>%
add_recipe(heart_recipe)
log_fit_results <- fit(log_wkflow,heart_train)
log_fit_results %>% tidy()
## # A tibble: 11 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.143 0.198 0.723 0.470
## 2 age -0.0771 0.225 -0.342 0.732
## 3 sex -0.666 0.226 -2.95 0.00319
## 4 cp 0.840 0.203 4.13 0.0000357
## 5 trtbps -0.403 0.206 -1.95 0.0508
## 6 thalachh 0.503 0.252 2.00 0.0456
## 7 exng -0.436 0.209 -2.09 0.0370
## 8 oldpeak -0.843 0.278 -3.03 0.00245
## 9 slp 0.195 0.247 0.790 0.429
## 10 caa -0.773 0.214 -3.62 0.000300
## 11 thall -0.456 0.189 -2.41 0.0159
We can use each of these models to generate probability predictions for the training data:
predict(log_fit_results, new_data = heart_train, type = "prob")
## # A tibble: 242 × 2
## .pred_0 .pred_1
## <dbl> <dbl>
## 1 0.995 0.00520
## 2 0.991 0.00919
## 3 0.911 0.0888
## 4 0.879 0.121
## 5 0.976 0.0243
## 6 0.306 0.694
## 7 0.348 0.652
## 8 0.817 0.183
## 9 0.991 0.00942
## 10 0.933 0.0666
## # ℹ 232 more rows
However, it’s more useful to summarize the predicted values. We can use augment() to attach the predicted values to the data, then generate a confusion matrix for Heart:
augment(log_fit_results, new_data = heart_train) %>%
conf_mat(truth = output, estimate = .pred_class) %>%
autoplot(type = "heatmap")
We can calculate accuracies on the training data:
log_acc_train <- augment(log_fit_results, new_data = heart_train) %>%
accuracy(truth = output, estimate = .pred_class)
log_acc_train
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.851
Let’s calculate the accuracy of the logistic regression model, or the average number of correct predictions it made on the testing data.
augment(log_fit_results, new_data = heart_test) %>%
conf_mat(truth = output, estimate = .pred_class) %>%
autoplot(type = "heatmap")
We can calculate accuracies on the testing data:
log_acc_test <- augment(log_fit_results, new_data = heart_test) %>%
accuracy(truth = output, estimate = .pred_class)
log_acc_test
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.852
log_roc_train <- augment(log_fit_results, new_data = heart_train) %>%
roc_auc(output, .pred_0)
log_roc_train
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.919
log_roc_test <- augment(log_fit_results, new_data = heart_test) %>%
roc_auc(output, .pred_0)
log_roc_test
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.930
let’s look at an ROC curve on the testing data:
augment(log_fit_results, new_data = heart_test) %>%
roc_curve(output, .pred_0) %>%
autoplot()
# Specify the linear discriminant analysis model
lda_model <- discrim_linear() %>%
set_engine("MASS") %>%
set_mode("classification")
# Create the workflow with the LDA model
lda_workflow <- workflow() %>%
add_model(lda_model) %>%
add_recipe(heart_recipe)
# Fit the LDA model
lda_fit_results <- fit(lda_workflow, data = heart_train)
augment(lda_fit_results, new_data = heart_train) %>%
conf_mat(truth = output, estimate = .pred_class) %>%
autoplot(type = "heatmap")
We can calculate accuracies on the training data:
lda_acc_train <- augment(lda_fit_results, new_data = heart_train) %>%
accuracy(truth = output, estimate = .pred_class)
lda_acc_train
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.843
We can calculate accuracies on the testing data:
lda_acc_test <- augment(lda_fit_results, new_data = heart_test) %>%
accuracy(truth = output, estimate = .pred_class)
lda_acc_test
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.852
accuracy on the testing data is greater than that of training data.
lda_roc_test <- augment(lda_fit_results, new_data = heart_test) %>%
roc_auc(output, .pred_0)
lda_roc_test
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.930
# Specify the quadratic discriminant analysis model
qda_model <- discrim_quad() %>%
set_engine("MASS") %>%
set_mode("classification")
# Create the workflow with the QDA model
qda_workflow <- workflow() %>%
add_model(qda_model) %>%
add_recipe(heart_recipe)
qda_fit_results <- fit(qda_workflow, data = heart_train)
augment(qda_fit_results, new_data = heart_train) %>%
conf_mat(truth = output, estimate = .pred_class) %>%
autoplot(type = "heatmap")
We can calculate accuracies on the training data:
qda_acc_train <- augment(qda_fit_results, new_data = heart_train) %>%
accuracy(truth = output, estimate = .pred_class)
qda_acc_train
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.864
We can calculate accuracies on the testing data:
qda_acc_test <- augment(qda_fit_results, new_data = heart_test) %>%
accuracy(truth = output, estimate = .pred_class)
qda_acc_test
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.820
These results suggest that the model performs well on both the training and unseen testing data, although there is a slight decrease in performance on the testing data, which is common as models generally perform slightly better on the data they were trained on.
qda_roc_test <- augment(qda_fit_results, new_data = heart_test) %>%
roc_auc(output, .pred_0)
qda_roc_test
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.887
en_spec_lm_class <- logistic_reg(mixture = tune(),
penalty = tune()) %>%
set_mode("classification") %>%
set_engine("glmnet")
en_grid <- grid_regular(penalty(range = c(0, 1)),
mixture(range = c(0, 1)),
levels = 10)
en_workflow_lm_class <- workflow()%>%
add_model(en_spec_lm_class) %>%
add_recipe(heart_recipe)
en_tune_res_class <- tune_grid(
object = en_workflow_lm_class,
resamples = heart_folds,
grid = en_grid
)
# For elastic net logistic regression model
en_metrics_class <- en_tune_res_class %>%
collect_metrics()
en_auc <- en_metrics_class %>% filter(.metric == "roc_auc")
# Print AUC results
print(en_auc)
## # A tibble: 100 × 8
## penalty mixture .metric .estimator mean n std_err .config
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 0 roc_auc binary 0.887 5 0.0169 Preprocessor1_Model001
## 2 1.29 0 roc_auc binary 0.888 5 0.0177 Preprocessor1_Model002
## 3 1.67 0 roc_auc binary 0.888 5 0.0177 Preprocessor1_Model003
## 4 2.15 0 roc_auc binary 0.888 5 0.0177 Preprocessor1_Model004
## 5 2.78 0 roc_auc binary 0.890 5 0.0179 Preprocessor1_Model005
## 6 3.59 0 roc_auc binary 0.889 5 0.0178 Preprocessor1_Model006
## 7 4.64 0 roc_auc binary 0.889 5 0.0182 Preprocessor1_Model007
## 8 5.99 0 roc_auc binary 0.889 5 0.0183 Preprocessor1_Model008
## 9 7.74 0 roc_auc binary 0.888 5 0.0182 Preprocessor1_Model009
## 10 10 0 roc_auc binary 0.888 5 0.0182 Preprocessor1_Model010
## # ℹ 90 more rows
best_en_class <- select_best(en_tune_res_class, metric = "roc_auc")
en_wf_final_class <- finalize_workflow(en_workflow_lm_class, best_en_class)
# Fit the finalized workflow to the training set
final_fit_en_class <- fit(en_wf_final_class, data = heart_train)
We can calculate accuracies on the training data:
en_acc_train <- augment(final_fit_en_class, new_data = heart_train) %>%
accuracy(truth = output, estimate = .pred_class)
en_acc_train
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.793
We can calculate accuracies on the testing data:
en_acc_test <- augment(final_fit_en_class, new_data = heart_test) %>%
accuracy(truth = output, estimate = .pred_class)
en_acc_test
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.754
en_roc_test <- augment(final_fit_en_class, new_data = heart_test) %>%
roc_auc(output, .pred_0)
en_roc_test
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.933
tree_spec <- decision_tree(cost_complexity = tune()) %>%
set_engine("rpart") %>%
set_mode("classification")
tree_wf <- workflow() %>%
add_model(tree_spec) %>%
add_recipe(heart_recipe)
param_grid <- grid_regular(cost_complexity(range = c(-5, -1)), levels = 10)
tune_tree <- tune_grid(
tree_wf,
resamples = heart_folds,
grid = param_grid
)
autoplot(tune_tree)
best_complexity <- select_best(tune_tree)
tree_final <- finalize_workflow(tree_wf, best_complexity)
tree_final_fit <- fit(tree_final, data = heart_train)
best_complexity
## # A tibble: 1 × 2
## cost_complexity .config
## <dbl> <chr>
## 1 0.00001 Preprocessor1_Model01
we can extract the model fit results and view them with the rpart.plot() function
tree_final_fit %>%
extract_fit_engine() %>%
rpart.plot()
We can calculate accuracies on the training data:
tree_acc_train <- augment(tree_final_fit, new_data = heart_train) %>%
accuracy(truth = output, estimate = .pred_class)
tree_acc_train
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.864
We can calculate accuracies on the testing data:
tree_acc_test <- augment(tree_final_fit, new_data = heart_test) %>%
accuracy(truth = output, estimate = .pred_class)
tree_acc_test
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.787
The accuracy on the training data is approximately 86.36%, and it decreases to about 78.69% on the testing data. This drop in accuracy from training to testing suggests that the decision tree model might be overfitting to the training data or not generalizing well to new, unseen data.
tree_roc_test <- augment(tree_final_fit, new_data = heart_test) %>%
roc_auc(output, .pred_0)
tree_roc_test
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.827
rf_class_spec <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
rf_class_wf <- workflow() %>%
add_model(rf_class_spec) %>%
add_recipe(heart_recipe)
rf_grid <- grid_regular(mtry(range = c(1, 10)),
trees(range = c(200, 600)),
min_n(range = c(10, 20)),
levels = 5)
rf_tune_class <- tune_grid(
rf_class_wf,
resamples = heart_folds,
grid = rf_grid
)
save(rf_tune_class, file = "rf_tune_class.rda")
load("rf_tune_class.rda")
autoplot(rf_tune_class) + theme_minimal()
best_rf_class <- select_best(rf_tune_class)
best_rf_class
## # A tibble: 1 × 4
## mtry trees min_n .config
## <int> <int> <int> <chr>
## 1 1 500 15 Preprocessor1_Model066
final_rf_model <- finalize_workflow(rf_class_wf, best_rf_class)
final_rf_model <- fit(final_rf_model, heart_train)
final_rf_model %>% extract_fit_parsnip() %>%
vip() +
theme_minimal()
We can calculate accuracies on the training data:
rf_acc_train <- augment(final_rf_model, new_data = heart_train) %>%
accuracy(truth = output, estimate = .pred_class)
rf_acc_train
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.913
We can calculate accuracies on the testing data:
rf_acc_test <- augment(final_rf_model, new_data = heart_test) %>%
accuracy(truth = output, estimate = .pred_class)
rf_acc_test
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.902
The model’s accuracy is reported to be approximately 91.32% on the training data, with a slight decrease to about 90.16% on the testing data. These results indicate that the random forest model has a strong performance and generalizes well to unseen data, with only a small drop in accuracy between the training and testing sets, which suggests a good balance between bias and variance.
rf_roc_test <- augment(final_rf_model, new_data = heart_test) %>%
roc_auc(output, .pred_0)
rf_roc_test
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.943
Within the scope of the project, I first made the data set ready for Exploratory Data Analysis(EDA)
performed Exploratory Data Analysis(EDA).
Analyzed numerical and categorical variables.
made the data set ready for the model.
used seven different algorithms in the model phase.
got 83% accuracy and 93% AUC with the KNN model.
got 85% accuracy and 93% AUC with the Logistic Regression Model.
got 85% accuracy and 93% AUC with the LDA (linear discriminant analysis) Model.
got 82% accuracy and 88% AUC with the QDA (quadratic discriminant analysis).
got 75% accuracy and 93.3% AUC with the Regularized Regression Model.
got 78% accuracy and 82% AUC with the Pruned Decision Trees Model.
got 90.2% accuracy and 95% AUC with the Random Forest Model.
When all these model outputs are evaluated, I prefer the model we created with the Random Forest Algorithm, which gives the best results.